Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(filemanager): ingest_id tagging and object move tracking #585

Merged
merged 8 commits into from
Oct 4, 2024

Conversation

mmalenic
Copy link
Member

@mmalenic mmalenic commented Oct 3, 2024

Closes #584

Mechanism

  • Introduces a mechanism by which the filemanager can track how an object moves by using S3 tags.
    • For each object, a tag is added to S3 to track the ingest_id.
    • The ingest id gets copied to the database as well.
    • When an object is moved with tags, the ingest_id is reused, which allows creating the sequence of records representing the moved object.
    • Attributes are also copied as the object moves.

Implementation

  • Added GetObjectTagging and PutObjectTagging capabilities to filemanager.
  • Added ingest_id column to database which matches the ingest_id on the S3 object tags.
  • Test cases for move logic.

@mmalenic mmalenic self-assigned this Oct 3, 2024
@mmalenic mmalenic added filemanager an issue relating to the filemanager feature New feature labels Oct 3, 2024
@reisingerf
Copy link
Member

Did you test that with any of the BYOB buckets?
I suppose this will require adjustment to the access permissions / bucket policies. Do you want to PR that against the infrastructure repo?
(i.e. this statement for each BYOB)

@mmalenic
Copy link
Member Author

mmalenic commented Oct 3, 2024

Did you test that with any of the BYOB buckets?
I suppose this will require adjustment to the access permissions / bucket policies. Do you want to PR that against the infrastructure repo?

Good point, I haven't tested on BYOB - I'll PR that on the infra repo.

the same object that has been moved, or two different objects. This is because S3 only tracks `Created` or `Deleted`
events.

To track moved objects, the filemanager stores additional information in S3 tags, that gets copied when the object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc nit:

"stores additional information in S3 tags. The tag field X gets updated when the object is moved."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and/or "see below for tag key-value details"

Copy link
Member

@reisingerf reisingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

The object tagging mechanism also doesn't differentiate between moved objects and copied objects with the same tags.
If an object is copied with tags, the `ingest_id` will also be copied and the above logic will apply.

## Alternative designs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIce doc!

I'd probably add another small note on the checksum approach: it can't be used if the checksums are not expected to be the same, e.g. with compression, which is a big use case for us.

The new tag is also stored in the `ingest_id` column.
* The database is also queried for any records with the same `ingest_id` so that attributes can be copied to the new record.

This logic is enabled by default, but it can be switched off by setting `FILEMANAGER_INGESTER_TRACK_MOVES`. The filemanager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging is currently part of the ingestion process, right?
So there's a possibility that this may slow down the ingestion and may become an issue under heavy load?
Not for now, but if that should become the case, we could think of an async tagging strategy.
Given the option to disable tagging (or tagging failing/missing for other reasons), it would be great to think of an async tagging option.

Copy link
Member Author

@mmalenic mmalenic Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree with this, it could potentially slow things down as it's done on ingestion. It's a bit tricky because the act of tagging the object conveys the information of the move - ideally this would be done as soon as possible (i.e. on ingestion). Anything async would extend the window that the object isn't tagged, meaning that the move can't be tracked. In practice this probably wouldn't make a different if the object isn't moved as soon as it's created.

There are s3:ObjectTagging:* events which might be good for this that I'll look into.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always challenging and tradeoff. Let's give it a shot!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tricky but that's a general "issue" with event bases systems: there's an inevitable delay/asynchronicity.

And I am not saying we should implement that now. An open ticket or comment in the code to keep track of it is perfectly fine.

To compensate for potential concurrency issues, the mentioned support of checksums, name matches, etc could be used... at least to some extend. All future considerations... all good for now!

the same object that has been moved, or two different objects. This is because S3 only tracks `Created` or `Deleted`
events.

To track moved objects, the filemanager stores additional information in S3 tags, that gets copied when the object
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and/or "see below for tag key-value details"

The new tag is also stored in the `ingest_id` column.
* The database is also queried for any records with the same `ingest_id` so that attributes can be copied to the new record.

This logic is enabled by default, but it can be switched off by setting `FILEMANAGER_INGESTER_TRACK_MOVES`. The filemanager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always challenging and tradeoff. Let's give it a shot!

@mmalenic mmalenic merged commit 643b0af into main Oct 4, 2024
6 checks passed
@mmalenic mmalenic deleted the feat/filemanager-tagging branch October 4, 2024 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature filemanager an issue relating to the filemanager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(filemanager): track how an object moves
4 participants